Hi,

> Am 26.10.2015 um 22:46 schrieb Timm Friedholz <[email protected]>:
> 
> Hello,
> 
> I have some PDF documents in which the glyph-Unicode character mapping is 
> destroyed so that it's not possible to search and copy the text.  In an 
> attempt to remove this restriction I've converted the PDFs to TIFF images and 
> performed OCR on them using tesseract.  Tesseract exports the recognized text 
> as PDF files in which the text is positioned transparently on top of the 
> images such that the text is searchable and selectable.
> 
> The problem is that the images in the PDF that tesseract outputs are 
> gray-scaled, large and high contrast versions of the original PDFs and I 
> would like to have the quality and file size of the original PDFs instead.  
> Thus my idea is to copy  the text objects of the OCR output to the original 
> PDFs.  To avoid interference with the old text, I've converted the original 
> PDFs to vector paths using Ghostscript:  gs -o out.pdf -dNoOutputFonts 
> -sDEVICE=pdfwrite in.pdf
> 
> Now the problem is that I'm not sure how to approach this programmatically.  
> Can I simply iterate over the pages and copy the text objects from each page 
> of one document to the corresponding page of the other document?  Which 
> operators do I need to copy if I parse it token by token?  Should I actually 
> do it as directly via the PDFStreamParser class or are there abstraction in 
> PDFBox that will make this easier?

the easiest might be to
a) remove the images from the OCR'ed document
b) overlay the pages from the OCR'ed document over the original PDF using 
org.apache.pdfbox.multipdf.Overlay

BR
Maruan


> 
> The code for the PDF export by tesseract is here:
> 
> https://github.com/tesseract-ocr/tesseract/blob/dd8c12997385cf7f5961093bcd44f0396b08f96f/api/pdfrenderer.cpp#L1
> 
> Here is the line that specifies the text objects:
> 
> https://github.com/tesseract-ocr/tesseract/blob/dd8c12997385cf7f5961093bcd44f0396b08f96f/api/pdfrenderer.cpp#L317
> 
> Thanks.
> 
> Timm
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to