Re: Detecting if PDF contains only/mostly images.

Tilman Hausherr Thu, 09 Nov 2017 10:01:13 -0800

Am 06.11.2017 um 16:12 schrieb Lachezar Dobrev:

   Well… It worked (mostly) as expected.
   The thing I did not expect is that a fraction of the the scanners
used turned out to be "smart"-ish. They attempt to perform OCR on the
scanned documents/images. They're actually doing a somewhat decent job
(I was impressed). The process however seems to result in a weird PDFs
that contains multiple layers of images stacked on top of each other
and text (where it was detected) that is stacked on top of the
graphics, and is *transparent* with *transparent* background (as far
as I understand), which is obviously invisible, but can be
select-copy-pasted, which is really nice.
   However that makes my job that much harder, since now bits and
pieces of the image are in different layers, and there *is* text
content.


   For the time being I am handling these by rendering the page to a
BufferedImage and then using manual ImageIO to render the page as a
Jpeg. The process seems to be very inefficient, a 124 KByte PDF file
ends up being converted to a 927 KByte Jpeg image (Java Image IO @ 90%


Save as PNG or (if b/w) as TIF.

quality). I have asked my colleagues to scan a test page that is
suitable for sharing (limited personal information), I'm open for
sharing method suggestions.

   So I'm looking for ways to improve. Is there any way I can:
   * Detect and skip text when it's transparent (PDFTextStripper)

tricky... you'd have to detect whether the font is invisible, or whetherit uses text rendering mode 3, or the color of the background.

   * Render the page to a BufferedImage, but detect the density from
the images in the page without the need to guess (currently guess-set
to 3*72 = 216 ppi).
   * Detect and possibly use colour space from the embedded images (to
skip colour for black-grey-white images)
   * (please suggest other items I may have overlooked)


Don't know... I think you can't win.

Tilman



2017-10-31 12:23 GMT+02:00 Tilman Hausherr <[email protected]>:

Heh heh... It's rather the opposite... it's a java library and the command
line tools are for convenience :-)

Tilman


Am 31.10.2017 um 11:18 schrieb Lachezar Dobrev:

    Ahh... You mean use the tool as a *ahm* tool?
    I'm so used to seeing these as parts of the command-line tools that
I've totally forgotten that their inner elements are suitable for use
in code. Thanks.

    I think I'm going to create a Writer implementation that throws
exception if non-white space is written to it, and use the
writeText(PDDocument,Writer) to quickly cancel processing when
non-white space is found.

2017-10-30 19:54 GMT+02:00 Tilman Hausherr <[email protected]>:

Am 30.10.2017 um 16:52 schrieb Lachezar Dobrev:

     I have been looking at it. I am actually using (a similar) approach
to read embedded bar-codes, but there I can test all images.
     The best I can see in ExtractImages is a way to check if there is
only one image. However I can not check if there is additional text or
other content, so that I do not mistakenly skip a page that has a
single logo (for instance) and lots of other text information.
     I tried looking at PDFTextStripper, but that is hard to follow.


That one is easy... just create the object, set start and end page, and
then
call getText().

Tilman

     Is there any sure(-ish) sign that there is text on a page that I can
use? Can I check for the existence of something that would tell me
that there is additional content on the page other than the single
image?

2017-10-30 15:53 GMT+02:00 Tilman Hausherr <[email protected]>:

Am 30.10.2017 um 14:04 schrieb Lachezar Dobrev:

      I have to process PDF files, that (supposedly) contain one big
image
per page, which is a result from a Document-Scanner. I'd like to avoid
performing PDF-To-Image in these cases, and use the underlying image
instead.
      I am not well-versed in all things PDF and have no idea how to
detect if a page has content other than a single image.
      Please advise.


Please have a look at the ExtractImages.java source code. You can
change
that one to your needs.

Tilman

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Detecting if PDF contains only/mostly images.

Reply via email to