I could imagine a workflow where you take the original PDF (either binary text 
or an image in PDF), and run it through Tika w/ Tesseract.   You then get back 
the tokens with the bounding boxes and then use that data to find your text.  
Go back, and use the image version of the PDF (which is created by Tika as 
well), and then overlay the black boxes…

Your output would be either a image or image only PDF…

I think contrary to what Tim said, we actually get the HOCR coordinates for 
either image only or underlying electronic text PDFs, as the PDF’s are 
converted on a page by page basis to an image first before Tesseract gets them, 
IIUC.

Eric


> On Nov 25, 2019, at 10:02 AM, Tim Allison <[email protected]> wrote:
> 
> Hi Furkan,
> 
>   First, are you processing PDFs or actual image files?  If PDFs, be careful 
> about blacking out images because there may be some record of the underlying 
> text in the file, and while a user might not be able to see the sensitive 
> information, that information may be available for inquiring minds.
> 
>   If PDFs, are these PDFs that are image-only or is there underlying 
> electronic text.  If image-only, you could use the hocr output from 
> tesseract, which reports coordinates in an html output file.
> 
>   Now, if there is underlying text, we aren't currently extracting text 
> positions from PDFs...although we could.  
> 
> @Eric Pugh <mailto:[email protected]>, recommendations?
> 
>   Cheers,
> 
>                       Tim
> 
> On Mon, Nov 25, 2019 at 7:39 AM Furkan KAMACI <[email protected] 
> <mailto:[email protected]>> wrote:
> Hi All,
> 
> I want to black out some particular texts at image (similar to described at 
> here: 
> https://helpx.adobe.com/acrobat/using/removing-sensitive-content-pdfs.html 
> <https://helpx.adobe.com/acrobat/using/removing-sensitive-content-pdfs.html>)
> 
> I know that I can find tokens at image via Tika. However, I need the 
> coordinates of a found token at image to automatically black out specific 
> texts. 
> 
> How can I achieve this?
> 
> Kind Regards,
> Furkan KAMACI

_______________________
Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | 
http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | 
My Free/Busy <http://tinyurl.com/eric-cal>  
Co-Author: Apache Solr Enterprise Search Server, 3rd Ed 
<https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>
    
This e-mail and all contents, including attachments, is considered to be 
Company Confidential unless explicitly stated otherwise, regardless of whether 
attachments are marked as such.

Reply via email to