I look at the y-coordinates of the characters in each line.
Machine-generated characters in the same line generally have y-coordinates
which are identical to 2 decimal places of a pixel. If the document is
physically scanned there will be slight variations however well the
operator places the paper. If the PDF is machine-generated and published as
an image then the antialiasing will cause random fluctuations of 0.1 pixel
or even more.

Peter

On Sat, Nov 23, 2024 at 5:31 PM Brian L. Matthews <blmatth...@gmail.com>
wrote:

> On 11/20/24 3:39 AM, Lachezar Dobrev wrote:
> >
> >   To the original poster 'achilles': there is no reliable way to
> > detect whether PDF file is a result of a document scan process, or has
> > been crafted. However Ulf Dittmer's suggestion to look for pages with
> > just a big image per page is (probably) the best option.
>
> I kind of do the opposite, extract the text and if there's more than a
> certain amount of it, treat it as a true PDF, not scanned (actually I do
> what the text extractor does, but bail out when I hit the threshold
> amount of text. You could also bail out if you see a certain number of
> pages with no text.)
>
> Brian
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
> For additional commands, e-mail: users-h...@pdfbox.apache.org
>
>

-- 
Peter Murray-Rust
Founder ContentMine.org
and
Reader Emeritus in Molecular Informatics
Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

Reply via email to