You're not lost - doing quite well I think.

Tesseract OCR only really reads black text on white background, so your 
approach of
processing the image to get that is good (and would fix about 1/2 of the other 
issues
which people report here...)

The original text is white with a black drop-shadow (to the right & down 
directions).
So, process for white original pix to be black in OCR image and anything darker
to be white.  (This may be what you've done already.). This is a combination of
Inversion and binarization.

These characters are fairly blocky - due to the low res original art.  If they 
are still blocky
after conversion to b/w then you may be able to fill in the blocks by using a 
dialate and
erode sequence (std. image proc ops, look up...) to fill the gaps somewhat 
intelligently.
This may help the recognition rates.

I can think of two approaches to address the specks at the top - either a noise 
elimination
image processing step or, maybe, a windowed approach to binarization.  The 
simplest
binarization technique is the one you are already using - a fixed threshold 
value for deciding
black or white.  A more complex approach is to vary the threshold value based 
on a window of
surrounding pixel values.  Research "Sauvola binarization" for details on a 
proven algorithm.

It's nicer to figure out what image processing is needed without extensive 
programming work.
Once you know what operations/algorithms are needed then you can call them from 
a
(hopefully) free and easy to use (and debugged) library (ex. OpenCV?).  To 
experiment
like this I use the demo program for Accusoft's ScanFix library - it lets you 
process images
with a sequence of pretty low level ops.  There are probably other "image 
processing
laboratory" apps available.  A paint program or viewer (Paint.NET, IrfanView) 
can do a lot
of these processing ops, but often not in a way that gives you access to the 
low level
details (like, choice of binarization algorithm, etc)

Finally, no OCR system is perfect - if your project requires perfect OCR then 
maybe rethink it
(Or buy a commercial OCR engine that can recognize 99+%, though, still not 
perfect...)

- Rich

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/1acfae5a-f4cb-4a48-89a2-683458271cc8%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to