You can use the PDFTextStripper <https://pdfbox.apache.org/docs/2.0.7/javadocs/org/apache/pdfbox/text/PDFTextStripper.html> utility to find out if there's "real" text on a page, or in the entire file.
On Mon, Oct 30, 2017 at 4:52 PM, Lachezar Dobrev <[email protected]> wrote: > I have been looking at it. I am actually using (a similar) approach > to read embedded bar-codes, but there I can test all images. > The best I can see in ExtractImages is a way to check if there is > only one image. However I can not check if there is additional text or > other content, so that I do not mistakenly skip a page that has a > single logo (for instance) and lots of other text information. > I tried looking at PDFTextStripper, but that is hard to follow. > > Is there any sure(-ish) sign that there is text on a page that I can > use? Can I check for the existence of something that would tell me > that there is additional content on the page other than the single > image? > > 2017-10-30 15:53 GMT+02:00 Tilman Hausherr <[email protected]>: > > Am 30.10.2017 um 14:04 schrieb Lachezar Dobrev: > >> > >> I have to process PDF files, that (supposedly) contain one big image > >> per page, which is a result from a Document-Scanner. I'd like to avoid > >> performing PDF-To-Image in these cases, and use the underlying image > >> instead. > >> I am not well-versed in all things PDF and have no idea how to > >> detect if a page has content other than a single image. > >> Please advise. > > > > > > Please have a look at the ExtractImages.java source code. You can change > > that one to your needs. > > > > Tilman > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [email protected] > > For additional commands, e-mail: [email protected] > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > >

