I look at the y-coordinates of the characters in each line. Machine-generated characters in the same line generally have y-coordinates which are identical to 2 decimal places of a pixel. If the document is physically scanned there will be slight variations however well the operator places the paper. If the PDF is machine-generated and published as an image then the antialiasing will cause random fluctuations of 0.1 pixel or even more.
Peter On Sat, Nov 23, 2024 at 5:31 PM Brian L. Matthews <blmatth...@gmail.com> wrote: > On 11/20/24 3:39 AM, Lachezar Dobrev wrote: > > > > To the original poster 'achilles': there is no reliable way to > > detect whether PDF file is a result of a document scan process, or has > > been crafted. However Ulf Dittmer's suggestion to look for pages with > > just a big image per page is (probably) the best option. > > I kind of do the opposite, extract the text and if there's more than a > certain amount of it, treat it as a true PDF, not scanned (actually I do > what the text extractor does, but bail out when I hit the threshold > amount of text. You could also bail out if you see a certain number of > pages with no text.) > > Brian > > --------------------------------------------------------------------- > To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > For additional commands, e-mail: users-h...@pdfbox.apache.org > > -- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK