Hi Jeremias, I just used the PDFDebugger and it actually makes a lot of sense now! :-)! Thanks!
Jochen On 10 Jul 2012, at 20:32, Jeremias Maerki wrote: > Hi Jochen, > > there is no "extra text layer", not even a "text layer" in PDF. Text > painting operators are just operators like those for painting lines, > curves and bitmaps. > > When I wrote that OCR programs write white-on-white text behind the > scanned bitmap, that is usually the result of text operators being > placed before the painting of the bitmap. Thus the bitmap basically lies > over the text because it was painted after the text was painted. But > there is probably no actual "layer". The so-called "optional content > groups" (OCG, since PDF 1.5) are sometimes used to create something like > a "layer" which can be disabled and enabled etc. Good OCR programs > probably create an OCG if the text > > If you want to know if a PDF is OCRed, just run PDFBox's text extraction. > If you get no text, you can probably try to run the OCR process. The > result of running OCR on an OCRed PDF is application-specific. There's > no single answer for that. > > Here's an extract (with my comments) of a scanned page that I've run > through Readiris Pro 12 (not the best OCR tool BTW): > > BT % begin text object > 3 Tr % text rendering mode: fill > 1 0 0 1 0 846 Tm % text matrix (position, scale...) > 138.48 -58.56 Td % move text position > /F00 23 Tf % select font /F00, size 23 (internally mapped to > TimesNewRoman) > (FS) Tj % write "FS" > 34.8 0 Td % move text position > (Hotel-) Tj % write "Hotel-" etc. etc. > 75.12 0 Td > (Stuttgart) Tj > 93.6 0 Td > (-) Tj > 13.44 0 Td > (Böblingen) Tj > 56.4 -65.28 Td > /F10 9 Tf > (Wolf) Tj > 23.04 0 Td > (-) Tj > 8.88 0 Td > (Hirth) Tj > 23.28 0 Td > (-) Tj > 8.64 0 Td > (Straße) Tj > > [.....] > > 6 0 Td % move text position > (Stuttgart) Tj % write "Stuttgart" > ET % end text object > q % save graphics state > 601.92 0 0 846 0 0 cm % concatenate transformation matrix (position, scale > etc.) > /img0 Do % Paint bitmap /img0 (the scanned page) > Q % restore graphics state > > So, just a bitmap painted over the recognized text. No layers, they > didn't even bother to paint the text in white. > > Jochen, fire up PDFBox's PDFDebugger [1] and load a few PDFs and browse > through the object tree. Look around. That'll give you a feeling of > what's in a PDF. Then download the PDF specification. It's not written > in Hieroglyphs or Klingon. ;-) > > [1] http://pdfbox.apache.org/commandlineutilities/PDFDebugger.html > > Jeremias Maerki > > > On 10.07.2012 19:41:24 Jochen Hebbrecht wrote: >> Hi Jeremias, >> >> No, I'm not having any trouble at all :-). Just curious about the working >> mechanism of PDFBox. And how Adobe created its PDF format. >> At this page >> (http://en.wikipedia.org/wiki/Portable_Document_Format#Adobe.27s_versions), >> you can see all previous (and current) versions of the PDF format. Can any >> of this format support the text layer? How does Adobe call this "extra text >> layer"? There's no information on Wikipedia telling me the technical details >> about this "text layer". >> >> Can we detect using PDFBox if an image has been OCR'rd? Or do we just try to >> get the contents? And if contents is null, try to OCR with some kind of OCR >> engine? >> >> And what happens if we try to OCR a PDF which was already OCR'd? Do we have >> an extra "text layer"? So 1 image, 1 layer with first OCR and 1 layer with >> secondary OCR? >> >> Jochen >> >> >> -----Oorspronkelijk bericht----- >> Van: Jeremias Maerki [mailto:[email protected]] >> Verzonden: dinsdag 10 juli 2012 16:11 >> Aan: [email protected] >> Onderwerp: Re: How does PDFBox extract text from a PDF? >> >> >> On 10.07.2012 15:36:02 Jochen Hebbrecht wrote: >>> My first question is: how is text stored in a PDF? I think there are 2 >>> ways to store text in a PDF: >>> a) vector PDF: the PDF contains a line telling it to print a word in a >>> specific font on a specific location >> >> That's the usual case, yes. >> >>> b) OCR text has been added to the image as an extra layer (I think >>> this is called, the XMP metadata) >> >> No, actually an OCR software usually just adds white-on-white text behind >> the bitmap. This would technically be like your a). >> >> XMP Metadata is really just for metadata, not actual text content. >> >>> Is this information correct? >>> >>> So, if PDFBox wants to extract text from a PDF, how does it extract >>> the data? Is it looking at the XMP metadata? Or the vector details? >>> Any developer wanting to help me on this issue? >> >> PDFBox interprets the text painting operators (as if it were painting the >> PDF), looks up the actual character for a code point (character "a" >> might be at code point 7 (or whatever) when a subset CID font is used, for >> example) and emits that as Unicode text. Well's that's simplified. >> There are some additional heuristics for things like placement and order of >> text but that doesn't really affect the actual process of extracting text. >> >> There is another location where a PDF can carry text but that's not >> supported by PDFBox, AFAIK: the "ActualText" entries of tagged PDFs can >> contain text of artifacts on a page (ex. an image). That's used for enabling >> visually impaired people to read certain documents. >> >> I guess the question is: what are you trying to do? Do you have a problem >> you're trying to solve? >> >> If you want to learn about how text is put into a PDF, run PDFBox's >> PDFDebugger and open a random PDF. That allows you to explore all the >> details of a PDF. Quite enlightening if you don't know the PDF specification >> by heart. >> >> Jeremias Maerki >> >

