Am 03.05.2019 um 17:23 schrieb Luca Loiodice:
Excellent, looks promising, thanks a lot for your help!
A related (still in the area of low quality extracted text) question ...
would it be also possible to detect which characters are drawn with a
font with no unicode mappings? I generally know for example how to detect
if a PDF has for example a type 3 font with no unicode
You could check whether getUnicode() is null or empty, that would be the
easiest. Or get the font, call getCOSObject() and check whether a
ToUnicode item exists. (However sometimes there is a missing ToUnicode
but getUnicode() returns something anyway... "it's complicated")
Tilman
mapping, but sometimes that font is only used for a small portion of the
characters in the page and wanted to special handle those characters.
Thanks again
On Fri, May 3, 2019 at 10:07 AM Tilman Hausherr <[email protected]>
wrote:
These answers may help:
https://stackoverflow.com/questions/50044892/pdfbox-invisible-text-from-pdftextstripper-not-clip-path-or-color-issue
https://stackoverflow.com/questions/50487520/pdfbox-2-0-invisible-text-from-pdftextstripper
Tilman
Am 03.05.2019 um 17:02 schrieb Luca Loiodice:
Hello,
I would need to remove (often low quality) invisible text placed on
images
by
tools which use OCR to make searchable PDF.
We use pdfbox ourselves to make searchable PDF... and we use
setRenderingMode(RenderingMode.NEITHER); when we place the text to
make it invisible.We also use pdfbox's text stripper to remove text from
PDF.
What I am not sure if there is a way for the text stripper to identify
the
characters that
have been placed as invisible and only remove those in some cases.
Thanks for your help,
Luca
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]