Well… It worked (mostly) as expected. The thing I did not expect is that a fraction of the the scanners used turned out to be "smart"-ish. They attempt to perform OCR on the scanned documents/images. They're actually doing a somewhat decent job (I was impressed). The process however seems to result in a weird PDFs that contains multiple layers of images stacked on top of each other and text (where it was detected) that is stacked on top of the graphics, and is *transparent* with *transparent* background (as far as I understand), which is obviously invisible, but can be select-copy-pasted, which is really nice. However that makes my job that much harder, since now bits and pieces of the image are in different layers, and there *is* text content.
For the time being I am handling these by rendering the page to a BufferedImage and then using manual ImageIO to render the page as a Jpeg. The process seems to be very inefficient, a 124 KByte PDF file ends up being converted to a 927 KByte Jpeg image (Java Image IO @ 90% quality). I have asked my colleagues to scan a test page that is suitable for sharing (limited personal information), I'm open for sharing method suggestions. So I'm looking for ways to improve. Is there any way I can: * Detect and skip text when it's transparent (PDFTextStripper) * Render the page to a BufferedImage, but detect the density from the images in the page without the need to guess (currently guess-set to 3*72 = 216 ppi). * Detect and possibly use colour space from the embedded images (to skip colour for black-grey-white images) * (please suggest other items I may have overlooked) 2017-10-31 12:23 GMT+02:00 Tilman Hausherr <[email protected]>: > Heh heh... It's rather the opposite... it's a java library and the command > line tools are for convenience :-) > > Tilman > > > Am 31.10.2017 um 11:18 schrieb Lachezar Dobrev: >> >> Ahh... You mean use the tool as a *ahm* tool? >> I'm so used to seeing these as parts of the command-line tools that >> I've totally forgotten that their inner elements are suitable for use >> in code. Thanks. >> >> I think I'm going to create a Writer implementation that throws >> exception if non-white space is written to it, and use the >> writeText(PDDocument,Writer) to quickly cancel processing when >> non-white space is found. >> >> 2017-10-30 19:54 GMT+02:00 Tilman Hausherr <[email protected]>: >>> >>> Am 30.10.2017 um 16:52 schrieb Lachezar Dobrev: >>>> >>>> I have been looking at it. I am actually using (a similar) approach >>>> to read embedded bar-codes, but there I can test all images. >>>> The best I can see in ExtractImages is a way to check if there is >>>> only one image. However I can not check if there is additional text or >>>> other content, so that I do not mistakenly skip a page that has a >>>> single logo (for instance) and lots of other text information. >>>> I tried looking at PDFTextStripper, but that is hard to follow. >>> >>> >>> That one is easy... just create the object, set start and end page, and >>> then >>> call getText(). >>> >>> Tilman >>> >>> >>>> Is there any sure(-ish) sign that there is text on a page that I can >>>> use? Can I check for the existence of something that would tell me >>>> that there is additional content on the page other than the single >>>> image? >>>> >>>> 2017-10-30 15:53 GMT+02:00 Tilman Hausherr <[email protected]>: >>>>> >>>>> Am 30.10.2017 um 14:04 schrieb Lachezar Dobrev: >>>>>> >>>>>> I have to process PDF files, that (supposedly) contain one big >>>>>> image >>>>>> per page, which is a result from a Document-Scanner. I'd like to avoid >>>>>> performing PDF-To-Image in these cases, and use the underlying image >>>>>> instead. >>>>>> I am not well-versed in all things PDF and have no idea how to >>>>>> detect if a page has content other than a single image. >>>>>> Please advise. >>>>> >>>>> >>>>> Please have a look at the ExtractImages.java source code. You can >>>>> change >>>>> that one to your needs. >>>>> >>>>> Tilman >>>>> >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe, e-mail: [email protected] >>>>> For additional commands, e-mail: [email protected] >>>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: [email protected] >>>> For additional commands, e-mail: [email protected] >>>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: [email protected] >>> For additional commands, e-mail: [email protected] >>> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]

