Re: How can i judge a PDF is a Scanned PDF?

Lachezar Dobrev Wed, 20 Nov 2024 03:42:05 -0800

Modern(-ish) scanners have an option to perform OCR on scanneddocuments. I've seen such PDF files that have a big image of the scanneddocuments as a back-ground, with lots of transparent text on top. Thatallows for the user to copy-paste text (OCR-ed) from such scanned documents.I vaguely remember a scanner that used to split the page inrectangles with less-than-one-page images if there was much white spaceon the page.On another note: there are many PDFs that contain pages with just onebig image per page. Presentations occasionally...

To the original poster 'achilles': there is no reliable way to detectwhether PDF file is a result of a document scan process, or has beencrafted. However Ulf Dittmer's suggestion to look for pages with just abig image per page is (probably) the best option.


На 20.11.24 г. в 12:49 ч., Ulf Dittmer написа:

I'm not quite sure what you mean by "scanned pdf", but if each page
basically consists of one image, and no text, that might be a strong
indication.

On Wed, 20 Nov 2024, 11:05 achilles, <1743702...@qq.com.invalid> wrote:

hi:
&nbsp; How can i judge a PDF is a Scanned PDF use pdfbox?
&nbsp; i don't find a api to judge a PDF is a Scanned PDF without any
text, i need you help, thank you!



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: How can i judge a PDF is a Scanned PDF?

Reply via email to