A sure sign that the text is the product of OCR, is that it is rendered in
mode 3 (command "3 Tr"); i.e. invisible. See PDF 1.7 specification
(32000-1:2008), section 9.3.6.

Unless the PDF producer adds some kind of visible watermark using text, all
text will be instructed to render in this mode.

C.D.

--
There is a computer disease that anybody who works with computers knows
about. It's a very serious disease and it interferes completely with the
work. The trouble with computers is that you 'play' with them!
- Richard P. Feynman


On Wed, Nov 20, 2024 at 1:41 PM Lachezar Dobrev <l.dob...@gmail.com> wrote:

>    Modern(-ish) scanners have an option to perform OCR on scanned
> documents. I've seen such PDF files that have a big image of the scanned
> documents as a back-ground, with lots of transparent text on top. That
> allows for the user to copy-paste text (OCR-ed) from such scanned
> documents.
>    I vaguely remember a scanner that used to split the page in
> rectangles with less-than-one-page images if there was much white space
> on the page.
>    On another note: there are many PDFs that contain pages with just one
> big image per page. Presentations occasionally...
>
>    To the original poster 'achilles': there is no reliable way to detect
> whether PDF file is a result of a document scan process, or has been
> crafted. However Ulf Dittmer's suggestion to look for pages with just a
> big image per page is (probably) the best option.
>
> На 20.11.24 г. в 12:49 ч., Ulf Dittmer написа:
> > I'm not quite sure what you mean by "scanned pdf", but if each page
> > basically consists of one image, and no text, that might be a strong
> > indication.
> >
> > On Wed, 20 Nov 2024, 11:05 achilles, <1743702...@qq.com.invalid> wrote:
> >
> >> hi:
> >> &nbsp; How can i judge a PDF is a Scanned PDF use pdfbox?
> >> &nbsp; i don't find a api to judge a PDF is a Scanned PDF without any
> >> text, i need you help, thank you!
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
> For additional commands, e-mail: users-h...@pdfbox.apache.org
>
>

Reply via email to