Re: [tesseract-ocr] Can tesseract be used to read a PDF and OCR it to text?

'pjfarley3' via tesseract-ocr Fri, 17 Jan 2020 15:02:18 -0800

Thanks for that link, but some research showed me that pdfgrep depends on 
the poppler libraries, which do not preserve text formatting in PDF's very 
well at all.

XPDF's version (https://www.xpdfreader.com/download.html) of pdftotext does 
the best job I have found so far, but when the PDF has erroneous or 
corrupted character map tables (as many of the PDF's I get from banks and 
utility companies do) it can't resolve all of the PDF text.

I can use Adobe Reader to view all the text information in these PDF's even 
with such bad internal tables, but transcribing them by hand or by mouse 
highlight/cop/paste are very time consuming.

Also, ocrmypdf's documenttation of the "sidecar" option also indicates that 
actual text in PDF's is not output at all, only OCR'ed text.  This defeats 
my need for reading and outputting ALL the text, hopefully with at least 
most of the textual formatting preserved.

Guess I will just have to keep looking around.

Peter

On Tuesday, January 14, 2020 at 5:42:57 AM UTC-5, JB Data31 wrote:
>
> OCRmyPDF do the job. 
>
> Linux native, but windows available : 
>
> https://ocrmypdf.readthedocs.io/en/latest/installation.html#installing-on-windows.
>  
>
>
>
> 2020-01-13 7:49 UTC+01:00, 'pjfarley3' via tesseract-ocr 
> <[email protected] <javascript:>>: 
> > 
> > 
> > On Sunday, January 12, 2020 at 8:52:51 PM UTC-5, shree wrote: 
> >> 
> >> Tesseract reads only image files, not pdf. You can convert PDF to image 
> >> (tif, png) and OCR those. 
> >> 
> >> Or use wrappers that use tesseract.which take a PDF and convert to 
> text. 
> >> Look under add-ons in wiki. 
> >> 
> >> 
> > Thanks for that advice, I will check the wiki. 
> > 
> > Peter 
> > 
> > -- 
> > You received this message because you are subscribed to the Google 
> Groups 
> > "tesseract-ocr" group. 
> > To unsubscribe from this group and stop receiving emails from it, send 
> an 
> > email to [email protected] <javascript:>. 
> > To view this discussion on the web visit 
> > 
> https://groups.google.com/d/msgid/tesseract-ocr/de8cf032-eb3d-41df-8127-805e84334909%40googlegroups.com.
>  
>
> > 
>
>
> -- 
> @*JB*Δ <http://jbigdata.fr/jbigdata/index.html> 
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/3b5d75a0-83d8-4dca-8d0b-5d343d7814ac%40googlegroups.com.

Re: [tesseract-ocr] Can tesseract be used to read a PDF and OCR it to text?

Reply via email to