Thanks for that link, but some research showed me that pdfgrep depends on the poppler libraries, which do not preserve text formatting in PDF's very well at all.
XPDF's version (https://www.xpdfreader.com/download.html) of pdftotext does the best job I have found so far, but when the PDF has erroneous or corrupted character map tables (as many of the PDF's I get from banks and utility companies do) it can't resolve all of the PDF text. I can use Adobe Reader to view all the text information in these PDF's even with such bad internal tables, but transcribing them by hand or by mouse highlight/cop/paste are very time consuming. Also, ocrmypdf's documenttation of the "sidecar" option also indicates that actual text in PDF's is not output at all, only OCR'ed text. This defeats my need for reading and outputting ALL the text, hopefully with at least most of the textual formatting preserved. Guess I will just have to keep looking around. Peter On Tuesday, January 14, 2020 at 5:42:57 AM UTC-5, JB Data31 wrote: > > OCRmyPDF do the job. > > Linux native, but windows available : > > https://ocrmypdf.readthedocs.io/en/latest/installation.html#installing-on-windows. > > > > > 2020-01-13 7:49 UTC+01:00, 'pjfarley3' via tesseract-ocr > <[email protected] <javascript:>>: > > > > > > On Sunday, January 12, 2020 at 8:52:51 PM UTC-5, shree wrote: > >> > >> Tesseract reads only image files, not pdf. You can convert PDF to image > >> (tif, png) and OCR those. > >> > >> Or use wrappers that use tesseract.which take a PDF and convert to > text. > >> Look under add-ons in wiki. > >> > >> > > Thanks for that advice, I will check the wiki. > > > > Peter > > > > -- > > You received this message because you are subscribed to the Google > Groups > > "tesseract-ocr" group. > > To unsubscribe from this group and stop receiving emails from it, send > an > > email to [email protected] <javascript:>. > > To view this discussion on the web visit > > > https://groups.google.com/d/msgid/tesseract-ocr/de8cf032-eb3d-41df-8127-805e84334909%40googlegroups.com. > > > > > > > -- > @*JB*Δ <http://jbigdata.fr/jbigdata/index.html> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/3b5d75a0-83d8-4dca-8d0b-5d343d7814ac%40googlegroups.com.

