You're not lost - doing quite well I think. Tesseract OCR only really reads black text on white background, so your approach of processing the image to get that is good (and would fix about 1/2 of the other issues which people report here...)
The original text is white with a black drop-shadow (to the right & down directions). So, process for white original pix to be black in OCR image and anything darker to be white. (This may be what you've done already.). This is a combination of Inversion and binarization. These characters are fairly blocky - due to the low res original art. If they are still blocky after conversion to b/w then you may be able to fill in the blocks by using a dialate and erode sequence (std. image proc ops, look up...) to fill the gaps somewhat intelligently. This may help the recognition rates. I can think of two approaches to address the specks at the top - either a noise elimination image processing step or, maybe, a windowed approach to binarization. The simplest binarization technique is the one you are already using - a fixed threshold value for deciding black or white. A more complex approach is to vary the threshold value based on a window of surrounding pixel values. Research "Sauvola binarization" for details on a proven algorithm. It's nicer to figure out what image processing is needed without extensive programming work. Once you know what operations/algorithms are needed then you can call them from a (hopefully) free and easy to use (and debugged) library (ex. OpenCV?). To experiment like this I use the demo program for Accusoft's ScanFix library - it lets you process images with a sequence of pretty low level ops. There are probably other "image processing laboratory" apps available. A paint program or viewer (Paint.NET, IrfanView) can do a lot of these processing ops, but often not in a way that gives you access to the low level details (like, choice of binarization algorithm, etc) Finally, no OCR system is perfect - if your project requires perfect OCR then maybe rethink it (Or buy a commercial OCR engine that can recognize 99+%, though, still not perfect...) - Rich -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/1acfae5a-f4cb-4a48-89a2-683458271cc8%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

