I could imagine a workflow where you take the original PDF (either binary text or an image in PDF), and run it through Tika w/ Tesseract. You then get back the tokens with the bounding boxes and then use that data to find your text. Go back, and use the image version of the PDF (which is created by Tika as well), and then overlay the black boxes…
Your output would be either a image or image only PDF… I think contrary to what Tim said, we actually get the HOCR coordinates for either image only or underlying electronic text PDFs, as the PDF’s are converted on a page by page basis to an image first before Tesseract gets them, IIUC. Eric > On Nov 25, 2019, at 10:02 AM, Tim Allison <[email protected]> wrote: > > Hi Furkan, > > First, are you processing PDFs or actual image files? If PDFs, be careful > about blacking out images because there may be some record of the underlying > text in the file, and while a user might not be able to see the sensitive > information, that information may be available for inquiring minds. > > If PDFs, are these PDFs that are image-only or is there underlying > electronic text. If image-only, you could use the hocr output from > tesseract, which reports coordinates in an html output file. > > Now, if there is underlying text, we aren't currently extracting text > positions from PDFs...although we could. > > @Eric Pugh <mailto:[email protected]>, recommendations? > > Cheers, > > Tim > > On Mon, Nov 25, 2019 at 7:39 AM Furkan KAMACI <[email protected] > <mailto:[email protected]>> wrote: > Hi All, > > I want to black out some particular texts at image (similar to described at > here: > https://helpx.adobe.com/acrobat/using/removing-sensitive-content-pdfs.html > <https://helpx.adobe.com/acrobat/using/removing-sensitive-content-pdfs.html>) > > I know that I can find tokens at image via Tika. However, I need the > coordinates of a found token at image to automatically black out specific > texts. > > How can I achieve this? > > Kind Regards, > Furkan KAMACI _______________________ Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | My Free/Busy <http://tinyurl.com/eric-cal> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw> This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such.
