You're not lost - doing quite well I think.

Tesseract OCR only really reads black text on white background, so your 
approach of
processing the image to get that is good (and would fix about 1/2 of the other 
which people report here...)

The original text is white with a black drop-shadow (to the right & down 
So, process for white original pix to be black in OCR image and anything darker
to be white.  (This may be what you've done already.). This is a combination of
Inversion and binarization.

These characters are fairly blocky - due to the low res original art.  If they 
are still blocky
after conversion to b/w then you may be able to fill in the blocks by using a 
dialate and
erode sequence (std. image proc ops, look up...) to fill the gaps somewhat 
This may help the recognition rates.

I can think of two approaches to address the specks at the top - either a noise 
image processing step or, maybe, a windowed approach to binarization.  The 
binarization technique is the one you are already using - a fixed threshold 
value for deciding
black or white.  A more complex approach is to vary the threshold value based 
on a window of
surrounding pixel values.  Research "Sauvola binarization" for details on a 
proven algorithm.

It's nicer to figure out what image processing is needed without extensive 
programming work.
Once you know what operations/algorithms are needed then you can call them from 
(hopefully) free and easy to use (and debugged) library (ex. OpenCV?).  To 
like this I use the demo program for Accusoft's ScanFix library - it lets you 
process images
with a sequence of pretty low level ops.  There are probably other "image 
laboratory" apps available.  A paint program or viewer (Paint.NET, IrfanView) 
can do a lot
of these processing ops, but often not in a way that gives you access to the 
low level
details (like, choice of binarization algorithm, etc)

Finally, no OCR system is perfect - if your project requires perfect OCR then 
maybe rethink it
(Or buy a commercial OCR engine that can recognize 99+%, though, still not 

- Rich

You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
To post to this group, send email to
Visit this group at
To view this discussion on the web visit
For more options, visit

Reply via email to