You're not lost - doing quite well I think.
Tesseract OCR only really reads black text on white background, so your
processing the image to get that is good (and would fix about 1/2 of the other
which people report here...)
The original text is white with a black drop-shadow (to the right & down
So, process for white original pix to be black in OCR image and anything darker
to be white. (This may be what you've done already.). This is a combination of
Inversion and binarization.
These characters are fairly blocky - due to the low res original art. If they
are still blocky
after conversion to b/w then you may be able to fill in the blocks by using a
erode sequence (std. image proc ops, look up...) to fill the gaps somewhat
This may help the recognition rates.
I can think of two approaches to address the specks at the top - either a noise
image processing step or, maybe, a windowed approach to binarization. The
binarization technique is the one you are already using - a fixed threshold
value for deciding
black or white. A more complex approach is to vary the threshold value based
on a window of
surrounding pixel values. Research "Sauvola binarization" for details on a
It's nicer to figure out what image processing is needed without extensive
Once you know what operations/algorithms are needed then you can call them from
(hopefully) free and easy to use (and debugged) library (ex. OpenCV?). To
like this I use the demo program for Accusoft's ScanFix library - it lets you
with a sequence of pretty low level ops. There are probably other "image
laboratory" apps available. A paint program or viewer (Paint.NET, IrfanView)
can do a lot
of these processing ops, but often not in a way that gives you access to the
details (like, choice of binarization algorithm, etc)
Finally, no OCR system is perfect - if your project requires perfect OCR then
maybe rethink it
(Or buy a commercial OCR engine that can recognize 99+%, though, still not
You received this message because you are subscribed to the Google Groups
To unsubscribe from this group and stop receiving emails from it, send an email
To post to this group, send email to firstname.lastname@example.org.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
For more options, visit https://groups.google.com/d/optout.