Hello I'm in the process of looking for a C++ OCR library for recognizing difficult to parse text in PDF files and I'm wondering if tesseract-OCR is used for this kind of thing.
Basically, some PDF files are corrupted or have non-standard encoding and I can't parse them using existing parsing tools built in C++. What I would then normally do is convert the pdf page (each page, one at a time) into an image file and then re-print it as a PDF file. I would then run Adobe's OCR Text Recognition function on it and then go on to parse the pdf file. . I'm wondering if tesseract can be used for this kind of thing? I need an OCR library in C++ to incorporate in my programs and I'm unsure if tesseract is such a library or not. Thanks -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/baccc4d3-7398-4bc1-ae86-5ac65e3a52b8%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

