Excellent, looks promising, thanks a lot for your help! A related (still in the area of low quality extracted text) question ... would it be also possible to detect which characters are drawn with a font with no unicode mappings? I generally know for example how to detect if a PDF has for example a type 3 font with no unicode mapping, but sometimes that font is only used for a small portion of the characters in the page and wanted to special handle those characters.
Thanks again On Fri, May 3, 2019 at 10:07 AM Tilman Hausherr <[email protected]> wrote: > These answers may help: > > https://stackoverflow.com/questions/50044892/pdfbox-invisible-text-from-pdftextstripper-not-clip-path-or-color-issue > > https://stackoverflow.com/questions/50487520/pdfbox-2-0-invisible-text-from-pdftextstripper > > Tilman > > Am 03.05.2019 um 17:02 schrieb Luca Loiodice: > > Hello, > > > > I would need to remove (often low quality) invisible text placed on > images > > by > > tools which use OCR to make searchable PDF. > > > > We use pdfbox ourselves to make searchable PDF... and we use > > setRenderingMode(RenderingMode.NEITHER); when we place the text to > > make it invisible.We also use pdfbox's text stripper to remove text from > > PDF. > > > > What I am not sure if there is a way for the text stripper to identify > the > > characters that > > have been placed as invisible and only remove those in some cases. > > > > Thanks for your help, > > Luca > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > >

